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We model the performance of an ideal closed chain of L processing elements that work in parallel 
in an asynchronous manner. Their state updates follow a generic conservative algorithm. The con- 
servative update rule determines the growth of a virtual time surface. The physics of this growth is 
reflected in the utilization (the fraction of working processors) and in the interface width. We show 
that it is possible to make an explicit connection between the utilization and the microscopic struc- 
ture of the virtual time interface. We exploit this connection to derive the theoretical probability 
distribution of updates in the system within an approximate model. It follows that the theoretical 
lower bound for the computational speed-up is s = (L + l)/4 for L > 4. Our approach uses simple 
statistics to count distinct surface configuration classes consistent with the model growth rule. It 
enables one to compute analytically microscopic properties of an interface, which are unavailable by 
continuum methods. 

PACS numbers: 89.20.-a, 89.20.Ff, 05.10.-a, 02.50.Fz 



I. INTRODUCTION 

In discrete event simulations a physical system with 
stochastic dynamics is modeled on a lattice of discrete 
points, and changes of its state are viewed as discrete 
events in time. Physical processes interact with each 
other at various points in simulation time. The stochas- 
tic nature of these interactions makes it difficult to uti- 
lize a parallel computing environment to the fullest ex- 
tent because a priori there is no global clock to synchro- 
nize physical processes. Examples of such complex sys- 
tems with underlying asynchronous dynamics come from 
a wide range of fields, such as, activated processes in 
chemistry, contact processes in epidemiology and ecol- 
ogy models, population dynamics, finance markets, com- 
munication networks and internet traffic, to mention a 
few. In physics an important example is an interact- 
ing spin system, where stochastic processes can be sim- 
ulated with a dynamic Monte Carlo approach. Until re- 
cently a common belief in the physics community was 
that even the simplest random-site update Monte Carlo 
schemes Q were inherently serial. A popular paralleliza- 
tion technique for these systems is the so-called "trivial 
parallelization" , in which each processor carries a copy 
of the full system. An obvious limitation of this tech- 
nique is imposed by the memory requirement, which may 
exceed available resources for a large-scale simulation. 
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In non-trivial parallelization, a system is spatially parti- 
tioned into subsystems, and each subsystem is placed on 
a different processor. In other words, in this way phys- 
ical processes and physical interactions between subsys- 
tems are mapped to logical processes and logical depen- 
dences between processing elements. Each logical pro- 
cess manages the state of the assigned physical subsystem 
and progresses in its own local virtual time (LVT). The 
asynchronous nature of the physical dynamics implies an 
asynchronous system of logical processes where discrete 
events are not synchronized by a global clock. Logi- 
cal processes execute concurrently and exchange time- 
stamped messages to perform state updates of the entire 
physical system being simulated. A sufficient condition 
for preserving causality in simulations (the so-called lo- 
cal causality constraint) requires that each logical process 
works out the received messages from other logical pro- 
cesses in nondecreasing time-stamp order 0, ■ 

Parallel discrete event simulations (PDES) are clas- 
sified in two broad categories: conservative PDES and 
optimistic PDES. In conservative PDES, originally stud- 
ied by Chandry and Misra 0, El and introduced by 
Lubachevsky in the study of dynamic Ising spin systems 
[El El , an algorithm does not allow a logical process to ad- 
vance its LVT (i.e., to proceed with computations) until 
it is certain that no causality violation can occur. In 
the conservative update scenario a logical process may 
have to be blocked and wait to ensure that no message 
with a lower time-stamp is received later. Recent physics 
applications of conservative PDES in modeling magneti- 
zation switching Q, ballistic particle deposition an d 
a dynamic phase transition in highly anisotropic thin- 
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film ferromagnets [jj, [^} suggest that the conservative 
algorithm should be very efficient in simulating the dy- 
namics of complex systems with short-range interactions. 
In optimistic PDES [U [H ELE1 0> originated by 
Jefferson's time warp algorithm [llj , an algorithm allows 
a logical process to advance its LVT regardless of the 
possibility of a causality error that may happen in the 
case of receiving a message with a lower time-stamp than 
the local clock. The optimistic scenario detects causal- 
ity errors and provides a recovery procedure from the 
violation of the local causality constraint by rolling back 
the events that have been processed prematurely. Al- 
though there are no general performance studies to date 
that would provide an unbiased comparison of the two 
groups of algorithms, a common perception is that an op- 
timistic PDES should outperform a conservative PDES. 
However, in the context of physics applications to Ising 
spin systems, recent numerical studies by Sloot et al. 
[16j demonstrate that near the Ising critical temperature, 
where long-range correlations occur in the physical spin 
system being modeled, the computational complexity of 
an optimistic PDES and the physical complexity of the 
modeled system are entangled, leading to a nonlinear in- 
crease of the roll-back length and a sudden deterioration 
of the run-time behavior when the number of computing 
processors is increased. 

There are several aspects of PDES algorithms that 
should be considered in systematic efficiency studies. 
Some important aspects are: the synchronization pro- 
cedures, the utilization of the parallel environment as 
measured by the fraction of working processors, memory 
requirements, inter-processor communications handling, 
scalability as measured by evaluating the performance 
when the number of computing processors becomes large, 
and the speed-up as measured by comparing the perfor- 
mance with sequential DES. In routinely performed stud- 
ies to date, the efficiency is investigated in a heuristic 
fashion by testing the performance of a selected appli- 
cation in a chosen PDES environment (i.e ., in a parallel 
simulator). Recently, Korniss et al. [ItJ introduced a 
novel and powerful approach in which a PDES algorithm 
can be studied in an abstract way by extracting key fea- 
tures of the algorithm, simulating its performance, and 
applying the methods of non-equilibrium surface growth 
|18| to evaluate its theoretical efficiency. In the Kor- 
niss et al. approach, the main concept is the simulated 
time horizon (STH), defined as the collection of LVTs 
of all logical processes. The growth rule of this virtual 
time surface is defined by the communication rule among 
logical processes (i.e., by their communication topology, 
which in turn is defined by the underlying dynamics of 
the physical system being simulated) and by the way in 
which the algorithm handles the advances in LVTs. In 
this picture, the utilization of the parallel environment is 
evaluated as the mean density of local update sites of the 
growing time interface, and the width of the interface at 
saturation provides a measure of desynchronization that 
is directly related to the memory requirements [19j . Scal- 



ability properties of a PDES algorithm can be assessed 
from these performance simulation studies 0, 0, |2(J • 

In the study of the STH generated by a conservative 
PDES [HI, it has been determined that in the worst- 
case conservative scenario for a closed spin chain, when 
each processing element (PE) carries only one spin site 
(i.e., each logical process simply corresponds to the flip- 
ping of one spin) and communicates only with its near- 
est neighbors, the time evolution of the STH on coarse- 
grained scales is governed by the Kardar-Parisi-Zhang 
(KPZ) stochastic equation |2l|. This proves, by univer- 
sality arguments, that the simulation phase of conserva- 
tive PDES is asymptotically scalable, which guarantees 
a nonzero utilization even for an infinite number of PEs. 
Using the same argument, it has been shown that the 
STH becomes infinitely rough in the limit of an infinite 
number of PEs, which suggests possible difficulties with 
data management. Thus, the measurement phase of con- 
servative PDES is not asymptotically scalable [ijj. Re- 
cent simulation studies [2(J show that conservative PDES 
can be made fully scalable when the algorithm is sup- 
plemented with either a moving time window constraint 
|22l |23| or additional scale-free communication patterns 
between PEs 0. 

From the physics point of view, the virtual time sur- 
face of the generic conservative PDES, with its morphol- 
ogy and dynamics, can be viewed as a surface growing 
through deposition of random time increments in accor- 
dance with a growth rule defined by a generic conser- 
vative PDES update rule. The physics of this growth 
is reflected in the utilization (the fraction of non-idling 
PEs) that corresponds to the mean number of deposi- 
tion events on the surface. In the case of a closed spin 
chain this is equivalent to the mean density of local min- 
ima in the interface. It should be possible, at least for 
steady-state simulations, to make an explicit connection 
between the utilization and the microscopic structure of 
the interface. Such a connection would enable rigorous 
studies of the update statistics and a closed theoretical 
formula for the utilization. The coarse-grained methods 
previously applied to this problem provide a proof 
of asymptotic scaling properties in the limit of a large 
number of PEs. Because of their continuum nature they 
cannot give a detailed microscopic description of the in- 
terface; neither is it certain if their results are valid for 
statistically feasible moderate to large numbers of PEs. 
On the other hand, the mean utilization strictly depends 
on the microscopic structure of the STH. In this paper we 
explore the connection between the STH interface mor- 
phology on the micro-scale and the update statistics by 
addressing the above questions. Recently, similar con- 
nections have been established between the interface mi- 
crostructure and its mobility for Ising and solid-on-solid 
models with various dynamics [2^, |2(| ■ 

Section [H] outlines the simulation algorithm for mod- 
eling the generic conservative PDES of spatially decom- 
posable cellular automata when each PE carries N lattice 
sites. The steady-state update statistics for N = 1 is ana- 
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lyzed in Sec. lIIII Here we derive formulas for the theoreti- 
cal utilization and the theoretical probability distribution 
of updates in the system within an approximate model. 
Our approach uses simple statistics to build and to count 
distinct surface configuration classes consistent with a 
model update rule (or deposition rule). The idea may be 
generally applied to any surface that grows on a lattice by 
a known growth rule. When the growth recipe is known, 
it is possible to construct diagrams of local lattice-site 
configurations and to translate the rule to dependences 
among the graphs. Then the event probability on the 
surface is deduced from the corresponding diagram of 
possible surface-configuration classes. The performance 
of conservative algorithms is discussed in Sec. IIVI where 
the results of Sec. Iffll are applied to estimate the the- 
oretical computational speed-up for the ideal system of 
PEs in a ring communication topology. In Sec. we 
discuss generalizations of our approach to other growth 
processes and advantages that follow in terms of practi- 
cal applications such as, e.g., the possibility of computing 
closed-form expressions for quantities that would be un- 
available by standard approaches. 



II. MODEL SIMULATIONS OF CONSERVATIVE 
UPDATE EVENTS 

We consider an ideal system of L processors, arranged 
on a ring (Fig. As an ideal system we understand a 
system of identical PEs, where communications between 
PEs take place instantaneously. Each PE carries N lat- 
tice sites, iVf, of which are border sites and (N — Nb) 
are interior sites (where all immediate lattice neighbors 
reside on the same PE) . On each PE the simulation algo- 
rithm randomly selects one of the N sites. If the selected 
site is a border site, the PE is required to communicate 
with its immediate neighbor(s) in an update attempt. If 
an interior site is selected, the update happens without 
communication between PEs. For this system, a discrete 
event means an update attempt. The state of the system 
does not change between update attempts. Processing 
elements perform operations concurrently. However, up- 
date attempts are not synchronized by a global clock. 

An example of the kind of system described above is a 
large, spatially extended ensemble of spins, arranged on 
a regular lattice, with a concurrent operation of random 
Monte Carlo spin-flip attempts. In this picture, the en- 
semble is spatially decomposed into L subsystems, each 
of which carries N spin sites. Each subsystem is placed 
on a PE, and the required communication is the exchange 
of information about states of the border spins (Fig. QJ. 
In the simplest case of N = 1, the system is a closed 
spin chain, and the spin-flip attempt at the fcth PE de- 
pends on the two nearest-neighbor spins located on the 
(fc - l)th and the (fc + l)th PEs. The fcth PE is not 
allowed to update until it receives information from the 
neighboring PEs. For general N, a sublattice assigned 
to a PE has Nb border spins. However, for example in 



Monte Carlo simulations, at each update attempt only 
one of the border sites may be randomly selected at a 
time: either a site from the left border slice or a site 
from the right border slice. Therefore, considering com- 
munications between logical processes, there are only two 
effective border sites per PE when N > 2. The case when 
N > 1 and the effective Nb = 1 , is realized when on each 
PE the left and the right border slices coincide. This 
case is equivalent to a closed spin chain, i.e., to the case 
of N= 1. 

In generic conservative PDES, to simulate asyn- 
chronous dynamics employing L processors, the kth PE 
generates its own local simulated time for the next 
update attempt. The fcth local simulated time models 
the LVT of the fcth logical process. Update attempts are 
simulated as independent Poisson-random processes, in 
which the fcth random time increment rjk (i.e., the ran- 
dom time interval between two successive attempts) is 
exponentially distributed with unit mean. A processor 
is allowed to update its local time only if the update is 
guaranteed not to violate causality. Otherwise, it remains 
idle. The time step t is the index of the simultaneously 
performed update attempt. It corresponds to an integer 
wall-clock time with each PE attempting an update at 
each value of t. Explicitly, in our model simulations the 
generic conservative update rule allows the fcth PE to 
update at any time step (t+1) if either of two conditions 
is satisfied. One: the randomly chosen lattice site is in 
the interior. Two: the randomly chosen lattice site is a 
border site and either of the following update conditions 
is satisfied: 

N=l : 7*(t)<min{7fc_i(t),7fe+i(t)}, (1) 
N> 2 : r k {t) < r r (t), (2) 

where r = k — 1 when the left border site is chosen 
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FIG. 1: The mapping of physical processes to logical pro- 
cesses considered in this work. The nearest-neighbor physical 
interactions (two-sided arrows in the left part) on a lattice 
with periodic boundary conditions are mapped to the ring 
communication topology of logical processes (two-sided ar- 
rows in the right part). Each PE carries N lattice sites, but 
communications take place only at border sites. In this study, 
each PE has at most two effective border sites. 
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FIG. 2: The growth and roughening of the STH for L = 100 
and N = 1: snapshots at ti (lower surface) and tu (upper 
surface). Here, t\ < t% <C t x m 3700. Local heights are in 
arbitrary units. 



and r = k + 1 when the right border site is chosen. 
Following a successful update attempt, the local simu- 
lated time is incremented for the next update attempt: 
Tfc(t + 1) = Tfe(t) + r]k(t). The random time increment 
rjk(t) is computed at each k and t as rjk(t) — — ln(r), 
where r G (0; 1] is a uniform deviate. The periodicity 
condition requires communication between the first and 
the last PEs in the chain: T£+i(f) = T\(t). In simulations 
we iterate either the update rule Q or the update rule 
(0), starting with the initial condition Tk(t = 0) = for 
all k. 

For the set of L processing elements, we define the 
STH as the set of L local simulated times at time step 
t. The mean height of the STH is given by the mean 
virtual time (r(t))i = l/tELi^W- Figure [21 presents 
the STH generated for a closed chain of L = 100 pro- 
cessors. As the time index advances the STH grows and 
roughens. The time evolution of the statistical spread of 
the interface is characterized by two distinct phases, the 
growth phase (when t -C t x ) and the saturation phase 
(when t > t x ), separated by the cross-over time t x . For 
a finite L, t x marks the transition to the steady state, 
where the average width of the interface is constant in 
time and is given by the power law (JVL) 1 / 2 [l7ll29| . 

To study the parallel efficiency, we define the utiliza- 
tion u(t) as the fraction of PEs that perform an update 
at the parallel time step t. The simulated utilization 
(u(t)) is computed as an ensemble average over many 
independent simulations. The time evolution of the sim- 
ulated utilization reaches a steady state (u(t)) — const, 
that depends on the system size (Fig. [5J: the steady- 
state utilization grows monotonically with TV. Note, for 
N = 1, according to the conservative update rule (JTJ, at 
t the update at the fcth PE site does not happen unless 
its cumulative local simulated time after (t — 1) steps is 
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FIG. 3: The time evolution of the utilization (u(t)) (averaged 
over K = 1024 simulations) for L = 10 and 10 4 with N = 
1, 10, 100. The result depends most strongly on N . 



not larger than the cumulative local simulated times at 
its neighboring PE sites. This means that an update at 
the /cth PE site corresponds to a local minimum of the 
STH at the fcth site. Accordingly, the mean utilization 
(u(t)) represents the mean number of local minima in 
the STH interface at t, averaged over many independent 
simulations. In an individual simulation, the utilization 
u(t) is the density of the local minima in the STH that 
is generated in this simulation. When N > 2 the utiliza- 
tion u(t) is the density of updating sites in the interface. 
It is important to distinguish between u(t) and (u(t)} 
as u(t) is characteristic of a particular class of the STH 
configurations while (u(t)) is the average measurement of 
u(t) taken over all possible configuration classes. In ana- 
lyzing the steady-state update statistics the steady-state 
utilization is denoted by u. 



III. STEADY-STATE UPDATE STATISTICS 

FOR TV = 1 

The STH of the generic PDES can be identified with a 
one-dimensional (Id) interface growing on a ring with the 
deposition of random time increments r/k in accordance 
with the deposition (update) rule given by Eq. QJ. The 
physics of this growth is reflected in the utilization. Be- 
cause the utilization is strictly related to the microscopic 
structure of the interface it is possible to make an explicit 
connection between the utilization and the morphology 
of the STH and to derive an analytical formula for the 
theoretical mean utilization as a function of the system 
size L. In this section we make this connection for N = 1 
when the STH growth has reached the saturation phase 
(i.e., when t > t x ). Our derivation of the update distri- 
bution makes the following two simplifying assumptions. 
First, we neglect correlations between nearest- neighbor 
local slopes. These depend on the type of deposition, 
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i.e., our derivation is not specific to the distribution from 
which the deposited % are sampled. This simplification 
is reflected in the assumption of equal statistical weights 
assigned to the legs of binary transition graphs that rep- 
resent possible choices of neighboring local sites. Second, 
we neglect temporal correlations among the groups of the 
surface configuration classes. Because of the above two 
simplifications, our theoretical result for the mean uti- 
lization is a mean-field like approximation to the mean 
utilization measured in simulations. 
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A. Theoretical utilization 

There are only four groups of elementary local site con- 
figurations of the STH that correspond to four mutually 
exclusive discrete events that take place at the kth PE 
site at t. These are as follows: "A" denotes an event 
when the update rule Q is satisfied from the left and 
from the right, i.e., when t^-i > and Tfc < Tfe+i; "B" 
denotes an event when the update rule JTJ is not satisfied 
from the right, i.e., when t^-i > 7fc and tj. > Tfe+i; "C" 
denotes an event when the update rule QJ is not satis- 
fied from the left, i.e., when Tfe_i < tx- and Tfc < Tfc+i; 
and, "D" denotes an event when the update rule is 
not satisfied from either side, i.e., when Tk-i < and 
i~k > Tfc+i- The corresponding elementary local config- 
urations of the STH at the fcth PE site are denoted by 
A, B, C and D (Fig. 0J. Because of the periodicity con- 
dition (i.e., tx + i = ri), during the steady state not all 
L sites can have the same elementary site configuration 
|35j . Therefore, in the set of L sites there must be at least 
one site with configuration A. Without losing generality, 
we assign the index k = 1 to one of the sites that are in 
the local configuration A and enumerate the other sites 
accordingly, progressing to the right. Its right neighbor 
(having index k — 2) can be only either in configura- 
tion C or in configuration D. Similarly, its left neighbor 
(having index k — L) can be only either in B or D. If 
site k = 2 is in configuration C then site k = 3 can be 
only either in configuration C or D. If site k — 2 is in 
D then site k = 3 can be only either in B or A. These 
choices are presented as transition graphs (binary trees) 
in Fig. |3J We adopt an approximation in which, during 
the steady state, the possible choices of transitions from 
the kth site to the right neighboring (k + 1) site are real- 
ized on average with equal frequency. Consequently, we 
assign equal statistical weights to each leg of the transi- 
tion graph in Fig. [SJ Starting from the site k — 1 and 
progressing to the right towards k = L, with the help of 
elementary transition graphs we can construct all possi- 
ble configuration equivalency classes of the entire surface 
generated by the deposition (update) rule QJ. These 
can be categorized into groups (called p-groups) based 
on the number p of the deposition (update) events at t, 
i.e., the number of local minima in the surface configura- 
tion (coded by "A" ) at t. The utilization of the p-group 
is u(p) = p/L. The probability distribution f(j>; L) of the 



FIG. 4: The four groups of elementary local surface configu- 
rations of the STH at the kth site. The index k denotes the 
fcth PE in the chain (N — 1). Each group corresponds to one 
of the four mutually exclusive discrete events A, B, C and D 
at an update attempt. "A" denotes an event when the update 
rule is satisfied. "B" denotes an event when the update rule 
is not satisfied from the right. "C" denotes an event when 
the update rule is not satisfied from the left. "D" denotes an 
event when the update rule is not satisfied from the left and 
the right. 
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k=L: B D 
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FIG. 5: Binary branching of possible choices in constructing a 
surface configuration from the elementary local configurations 
A, B, C, and D of Fig. 0] (a) The alternatives that must be 
followed starting with A at k = 1 and progressing towards 
k = L to the right, (b) The only possible alternative for a 
periodic chain closed at k — L: the left neighbor of site k = 1 
must have configuration either B or D. 



deposition (update) events is obtained as the quotient of 
the multiplicity M{p) of the p-group configuration class 
and the total number M of configuration classes [3jj . 

For example, the binary tree for the construction of 
possible surface configuration classes for L = 5 is shown 
in Fig. Looking along its branches, starting from the 
leading A at the fixed k = 1 position, it is easy to identify 
a total of eight possible configuration classes of the entire 
surface: 1. ACCCD; 2. ACCDB; 3. ACDBB; 4. AC- 
DAD; 5. ADBBB; 6. ADBAD; 7. ADACD; 8. ADADB. 
Note, according to the surface construction rule, the class 
representative #4 is not equivalent to the class represen- 
tative #7 because the leading A in configuration #7 has 
a local maximum as its right neighbor and configuration 
#4 does not have this property. If the assignment of 
an index to a site were irrelevant, all configurations that 
can be obtained under an even permutation of sites would 
have fallen into one equivalency class. The surfaces repre- 
senting configurations #1-8 are sketched in Fig. Each 
surface configuration represents a class of infinitely many 
topologically equivalent deformations because the de- 
posited random time increment is a real positive number 
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FIG. 6: Binary tree for the construction of all possible con- 
figurations of the surface for L — 5. A number to the left 
of the configuration symbol denotes the level of branching. A 
number in parenthesis to the right of the configuration symbol 
denotes the number of branching levels in a subtree. Notice 
the recurrent structure: the graph consists of the nested trees 
A(4), D(3), A(2), D(l). The dashed lines mark the transi- 
tion cuts to the lower level trees. A(l) and D(l) denote the 
one level branch A and D, respectively, that mark the end of 
branching. See discussion in Sec. IIH Al and Appendix lAl 




FIG. 7: The graphs of possible surface configuration classes 
that correspond to the configurations read along the branches 
from Fig. El 1. ACCCD; 2. ACCDB; 3. ACDBB; 4. AC- 
DAD; 5. ADBBB; 6. ADBAD; 7. ADACD; 8. ADADB. 
Each graph represents a class of infinitely many topologically 
equivalent deformations. 



that can take on continuous values in the interval [0; oo). 
There are only two p-groups. In the first group there are 
four classes with one letter A: Af(l) = 4, /(1;5) = 1/2 
and it(l) = 1/5. In the second group there are four 
classes with two letters A: M(2) = 4, /(2;5) = 1/2 
and u(2) = 2/5. Thus, for L = 5 the mean utiliza- 
tion that is measured during steady-state simulations is 
(u{L = 5;N = 1)> = /(l; 5)u(l) + /(2; 5)u(2) - 3/10. 

For general L, the utilization measured in simulations 
during the steady state is the mean frequency of the local 
surface minima, averaged over all admissible surface con- 
figurations. It can be obtained from the generally valid 
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FIG. 8: The recurrent structure of the binary tree in con- 
structing the classes of surface configurations for general L. 
The meaning of the symbols is the same as in Fig. For 
general L, the highest level tree is 1A(L — 1) that has 2 L_2 
branches. Each branch represents a class of surface config- 
urations. The branches are categorized in distinct groups. 
Each group contains configurations with exactly p repetitions 
of A(l). The smallest p is 1, the largest p is [41 . The utiliza- 
tion in each group is p/L. 



formula for the computation of averages: 



(«(L;JV))=53/(p;i)u(p), 



(3) 



where the summation extends over all p-groups of the 
admissible surface configuration classes, u(p) is the uti- 
lization characteristic for each group, and f(p; L) is 
the frequency of the occurrence of p-group during the 
steady state. To find the theoretical f(p;L), one can 
exploit the recurrent structure of the corresponding bi- 
nary tree (Fig. |SJ) in counting the classes of the sur- 
face configurations (branches) that contain the elemen- 
tary site configuration A at exactly p number of sites, 
p = 1,2,3, ... ,p max = [§] ([f] denotes the integral 
part, which is L/2 for even L and (L — l)/2 for odd L). 
The details of the derivation are given in Appendix ^] 
The total number of configuration classes is M = 2 L ~ 2 . 
The number of branches with exactly p occurrences of A 
is M(p) = {L- 1)!/ ((2p - 1)! (L - 2p)\). The frequency 
of occurrence of the p-group is f(p; L) — M(p) /M. Thus, 
the theoretical mean utilization of the steady state is 



M*i»-^g(£:i)! 

1/2 ,L = 2 
(£ + l)/4L ,L > 3 



(4) 



The theoretical utilization (u(L; 1)) is bounded from be- 
low by (u(L -► oo; 1)) = 1/4 (Fig.©. 

In classifying individual configurations, the underly- 
ing principle is provided by the deposition rule given 
by Eq. QJ. Therefore, the local A-configuration rep- 
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Eq. ©, where f(p; L) is estimated from the steady-state 
simulation data. Denoting by G(p; L) such an "experi- 
mental" frequency, we write explicitly 



(u(t)) 



K 



1 K [~\ 

- 2 «(<,*)= 2 G(p,t)«(p), 



(5) 



i=l 



p=l 



where the rhs follows simply from grouping the summa- 
tion terms. This is possible because u(i, t) takes on only 
the values u(p) that characterize the p-group of the sur- 
face configuration classes. Having a sequence of mea- 
sured frequencies G(p; L) over the steady-state time in- 
terval, the time average (G(p; £))t can be computed for 
each p. After time averaging, Eq. (J3J) gives 
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FIG. 9: The steady-state mean utilization as a function of the 
system size for N = 1. The continuous curve represents the 
analytical result (Eq. It converges to lim.L_ 00 {u(L; 1)) = 
1/4 (horizontal line). The circles represent the utilization 
measured in simulations, with error bars smaller than the 
symbol size. 



resents four types of update events, and the local con- 
figuration B (or C) represents two types of no-update 
events (Fig. . The small differences between the simu- 
lation results and Eq. (0J, clearly observed in Fig.|5J come 
mainly from neglecting temporal correlations among p- 
groups of surface configuration classes in our derivation. 
These correlations are intrinsically present in the com- 
putation of averages over time scries in simulations but 
are absent in our model. They depend on the type of 
deposition, i.e., the probability distribution from which 
the random time increments rjk are sampled. A possi- 
ble second source of discrepancies is the assumption of 
equal statistical weights in the transition graphs (Fig. [SJ . 
When the actual weights are only approximately equal, 
this modifies the frequency f(p; L) of the occurrence of a 
p- group in Eq. yj, so a particular surface configuration 
class may occur slightly more (or less) often in simula- 
tions than would result from our assumption. Note, this 
modifies only /(p; L); the utilization u(p) of a p-group is 
not changed. In deriving f(p; L) the underlying assump- 
tion implies that any class of the entire surface configu- 
rations is equally probable. The factor 1/M = 1/2 L ~ 2 in 
Eq. (0J has the meaning of this probability ( Appendix lA"|) . 



(u) 



K,T 



£<G(p)> T u(p). 



(0) 



The corresponding statistical spread of the measured av- 
erage utilization 5(u), i.e., the standard deviation of the 
mean (u(L)}, can then be determined from the measured 
standard deviations of G(p; L): 
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B. Computation of averages 

In simulations, the average steady-state utilization is 
measured at each t as the arithmetic average over an en- 
semble of K independent simulations and then averaged 
over a series of T time steps during the steady state. At 
each t, this is equivalent to the computation of averages 
over the surface configuration classes in accordance with 



FIG. 10: The probability distribution for L — 50: the theo- 
retical f(p;L) (histogram) and {G(p;L))t measured in simu- 
lations (symbols). The error bars represent one standard de- 
viation from the mean of the measured time sequence at satu- 
ration (the quantity 5G(p) that enters Eq. J7J). The measured 
frequencies were obtained from an ensemble of K — 2048 in- 
dependent simulations as K(p)/K, where K(p) is the number 
of trials that produced the p-group of the surface configura- 
tion classes. 
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FIG. 11: The time sequence of frequencies of the surface configurations characterized by the utilization u(p) = p/L. The 
continuous horizontal lines represent the theoretical f(p;L). Symbols are simulation data G(p;L). The dashed horizontal 
lines represent time averages (G(p; L))t over an interval of 1000 t-steps, beginning at t — 10 8 . The error bars represent one 
standard deviation from {G(p; L))t as in Fig. EH The dat a were taken in K = 1024 simulations: (a) For L = 4: (G(1;4))t = 
0.7323 ± 0.0138, (G(2; 4)> T = 0.2677 ± 0.0138; (b) For L = 11: (G(l; 11)) T = 0.0268 ± 0.0051, (G(2; 11)) T = 0.2583 ± 0.0144, 
(G(3; 11)) T = 0.4759 ± 0.0157, (G(4; 11)) T = 0.2194 ± 0.0134 and (G(5; 11)) T = 0.0194 ± 0.0044. 



where SG(p) denotes the empirical standard deviation 
of the G(p; L) time sequence. At each t, the frequen- 
cies G(p\ L) are found by directly counting the simu- 
lations that produced u(p) = p/L and, subsequently, 
computing the quotient of this count K(p) and the to- 
tal number K of simulations in an ensemble. Explic- 
itly, for p — 1,2, •••)[■§]) the measured frequency is 
G(p;L) = K(p)/K, where K = J2 p K(p) (Fig.EJ. " 

A typical time sequence of G(p;L), measured in K = 
1024 independent simulations, is shown in Fig. ^] For 
L = 4, the theoretical steady-state frequencies, /(1;4) — 
3/4 and /(2;4) = 1/4, differ slightly from the aver- 
ages (G(p;L))t ± SG(p) computed over an interval of 
T = 1000 steps, beginning at t = 10 8 . The measured 
steady-state utilization (u(L; 1)) ± 5(u(L)) is (u(4; 1)) = 
0.3169 ± 0.0077. Similarly, for L = 11 the mea- 
sured frequencies are in close agreement with the theory: 
/(l; 11) = /(5; 11) = 5/256, /(2; 11) = /(4; 11) - 15/64 
and /(3; 11) = 63/128. The measured steady-state uti- 
lization is: (u(ll; 1)) = 0.2678 ± 0.0073. The theoret- 
ical utilizations (u(4;l)) = 5/16 and (u(ll; 1)) = 3/11 
(from Eq. (0J) compare with the utilizations measured 
in simulations well within the statistical error bars when 
K = 1024, 2048; likewise, there is very good agreement 
for general L. However, when K — 4096 the statistical 
spread S(u(L)) is small enough to see that the results of 
Eq. Q lie above the simulation data in Fig. E| 

The standard deviation of the distribution of u(p) 
among admissible p-groups of the surface configuration 
classes can be measured directly in simulations as the 



square root of the variance var(u): 

var(u) w (u 2 ) A ',t - (u)% >T (8) 

([t] \ 2 

£<G(p;£)>ru(p) - 
p=i J 

where the rhs comes from grouping terms in the sum- 
mations. The statistical spread in u(p) is the largest for 
L = 4 and decreases when L increases. Equation (JSJ 
gives the measured average variance of the probability 
distribution of updates in the system, scaled by L 2 . 

C. Probability distribution of updates 

The derived theoretical probability distribution of up- 
dates in a closed linear chain of L processors, each car- 
rying N — 1 lattice sites and following the conservative 
update rule, is 

where p is the number of updates at the tth update at- 
tempt in the steady-state simulation. In an equivalent 
interpretation, Eq. @ gives the probability of generat- 
ing a surface with p local minima at saturation when the 
surface growth obeys rule In other words, in the lat- 
ter interpretation f(p; L) is the probability distribution 
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of the deposition events on the surface. Equation JJjJ, 
derived in Appendix^ was already used in calculating 
(u(L; 1)) in Eq. J3J. FigurelT^lpresents f(p; L) for various 
system sizes. It can be seen that the measured utiliza- 
tion is (u) = (p) /L, where (p) is the mean of /(p; L). The 
variance a 2 of f(p; L) (and, thus the standard deviation 
of u) can be obtained in the usual way ( Appendix IB")) as 



0.11 



a 2 =^2( m - (p)) 2 f(m;L) 



L-l 
16 



(10) 



for L > 4, and a 2 — for L — 2, 3; where 

( P 2 )^m 2 /( m ;L) = M (11) 



for L > 4, and (p 2 ) = 1 for L = 2, 3; and 



(P> = ^2mf(m;L) 



L + l 



(12) 



for L > 3, and (p) = 1 for L = 2. Also, using Eq. (U2J, it 
can be derived that (1/p) = 1 for L = 2, and for L > 3: 



p / z — ' m 

= I^ij(E(*-i)/(^ + »-^) 



4 'i ! 



L 2 L - 1 /' 



and for L > 4: 



<P 3 > - 
(P 4 > = 



i 3 + <6L 2 + 3L - 2 
64 ' 
L(L 3 + 10L 2 + 15L - 10) 
256^ ' 



(13) 

(14) 
(15) 



The form of Eq. © permits exact computations of all 
moments of f(p;L). The corresponding recursion for- 
mula is given in Appendix^] In this way we find that the 
skewness of f(p; L) is zero, which means that f(p; L) is 
symmetric about (p) . The computed kurtosis (the fourth 
central moment) is strictly positive for L > 4, which 
means that f(p; L) is more pointed than a Gaussian. 

The theoretical standard deviation of the utilization 
distribution among various p-groups, i.e., the statistical 
spread of u(p) in an ensemble of independent simulations, 
can be computed from Eq. illOfl : 



4L 



(16) 



In an equivalent interpretation, Eq. (|Td|) gives the distri- 
bution of u(p) among various p-groups of surface config- 
uration classes. 

The statistical distribution of the updates in the sys- 
tem of L processors can also be estimated directly from 




50 100 150 200 250 300 350 400 450 
P 

FIG. 12: The probability distribution f(p; L) of p updates 
in a closed linear chain of L PEs, each carrying one lat- 
tice site and following the conservative update rule. L = 
250, 500, 1000, 1500 (from left to right). 



the simulation data, without any presupposed underlying 
model. It is sufficient to notice that the update at a PE 
site happens when the STH has a local minimum at that 
site and that the Id surface of L sites may have no more 
than [|f] local minima. Denoting by K(p) the number 
of simulations that produced surfaces with exactly p lo- 
cal minima in the sequence of K independent simulations 
(p and K(p) are directly counted in simulations), at any 
t the "experimental" distribution is G(p;L) = K(p)/K. 
Its time average in the steady state is (G(p; L))t- The 
variance of (G(p; L))t can be obtained directly from the 
simulation data as described in Sec. IIII Bl (by the rhs 
of Eq. JSJ multiplied by L 2 ). In the infinite iGlimit 
(G(p;L)}t converges to the exact steady-state distribu- 
tion g(p;L). At any t the average of any observable Q 
that depends on the number of local minima can be eval- 
uated as 



A' 



(Q)k « ^EQ(r) 



P =i 



. f2j / K(p) 



K(p), 



(17) 



where (-)k( p ) is the mean over the measured configura- 
tion classes in the p-group. The exact mean in the steady 
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state is 

[#] 

(Q)=J2g(p;L)Q( P) L), (18) 
P =i 

where Q(p, -L) is the value typical for the p-group. 

The question that naturally arises at this point is 
how close the theoretical distribution f(p;L) given by 
Eq. © represents the exact distribution g(p; L) that en- 
ters Eq. Q18JI. The results for L = 50 obtained with 
K = 2048, presented as a histogram in Fig. 1101 show 
that f(p; L) mimics the overall shape of the experimen- 
tal {G(p; L))t very well. Increasing K to 4096 improves 
the error bars of (G(p; L))t (in particular, for the ex- 
treme values of p); yet, as K gets larger the overall 
shape of (G(p; L))t remains unchainged and the differ- 
ence (G(p;L))x — f(p;L) does not entirely vanish. This 
difference is most pronounced when L is small and is 
largest for L — 4. Since for L = 4 there are only two 
values of p, it is easy to estimate the exact g(p; L) to a 
high degree of confidence. For K = 1024: (G(1;4))t = 
0.73226 ± 0.01378, (G(2;4)) T = 0.26773 ± 0.01378. For 
K = 5120: (G(l; 4)) T = 0.73186 ± 0.00611, (G(2; 4)) T = 
0.26813 ± 0.00611. As K increases (G{p;L)) T converges 
to: 5(1; 4) w 0.732, g(2;4) w 0.268; while the theoretical 
estimate is: /(1;4) = 0.75, /(2;4) = 0.25. The small 
difference g(p; L) — f(p; L) for the worst case of L = 4 
indicates that Eq. © is a close approximation to the 
time-averaged exact distribution g(p; L). We believe, the 
primary reason for this difference is the lack of tempo- 
ral correlations among various p-groups in our computa- 
tional model, as was pointed out in Sec. 1111 Al Regardless 
of this simplification, the theoretical standard deviations 
(j(u) given by Eq. i|16|) agree with a(u) computed directly 
in simulations via Eq. JSJ. 



IV. PERFORMANCE OF A CONSERVATIVE 
ALGORITHM 

The computational speed-up s of a parallel algorithm 
is defined as the ratio of the time required to perform a 
computation in serial processing on one PE to the time 
the same computation takes in parallel processing on L 
processors. It is easy to derive from the above definition 
that for an ideal system of processors the computational 
speed-up is the product of the number L of PEs in the 
system and the utilization (u(L;N)) of the parallel pro- 
cessing environment: s — L(u(L; N)}. In other words, 
the speed-up is measured by the average number of PEs 
that work concurrently between two successive update 
attempts. 

We observe that for ideal PEs the speed-up as a func- 
tion F{L) must be such that the equation F(L) = s has 
a unique solution, where s is a fixed positive number. 
This requirement follows naturally from the logical ar- 
gument that distributing the computations over L ideal 



processors gives a unique speed-up, i.e., two ideal sys- 
tems having sizes L\ and L2, respectively, may not give 
the same s. This means that F(L) must be a monotoni- 
cally increasing function of L. 

In our model for N = 1 the average number of PEs that 
work in parallel, i.e., the speed-up, is the mean number 
of local minima in the STH during steady-state simula- 
tions. In Section l!!! Cl this number is computed explicitly 
as the first moment of a theoretical distribution given by 
Eq. In this way, translating Eq. 11211 to the language 
of applications, the theoretical speed-up that can be ob- 
tained in an ideal system of PEs performing conservative 
PDES in the ring communication topology with N = 1, 
is given by: 

S= { (L + l)/4 [l>4 3 ■ (19) 

In what follows we discuss the consequences of Eq. <|19[) 
(or its alternative Eq. from the point of view of ap- 
plications. 

One of the consequences is that the theoretical upper 
bound for (u(L; 1)) is 1/2. This corresponds to the sit- 
uation when only one of the two PEs is working at a 
time while the other one idles. In the picture when the 
simulations represent operations performed by a paral- 
lel algorithm, when L = 2 or L = 3 the parallelization 
within the conservative update scheme does not give an 
advantage in terms of the computation time because the 
processors work alternately. For an ideal system of PEs, 
where communications between PEs take place instan- 
taneously, such an operation will not produce speed-up. 
For a real system of PEs, the communication overhead 
will produce an actual slow-down, i.e., the parallel exe- 
cution time will be longer than the sequential execution 
time on one PE. Between the update attempts during the 
steady state the average number of PEs working in par- 
allel is L{u(L; 1)) = (L + l)/4. This means, when L = 4 
or 5 the actual number of working PEs is still either 1 
or 2; and when L = 6 or 7 the actual number of working 
PEs is 1,2, or 3 at a time and on average this actual 
number is still either 1 or 2. This will produce a small 
speed-up for an ideal system of PEs, but for a real system 
of PEs this speed-up may be negligible or not present at 
all. A noticeable advantage in terms of speed-up should 
be expected when the average number of working PEs is 
(L + l)/4>2, which gives L > 8. For a real system of 
PEs the best speed-up will not be larger than the average 
speed-up for an ideal system, i.e., not larger than s given 
by Eq. 1(15). 

The parallel utilization efficiency (u(L; N)) and the 
speed-up, as measured by the average number L(u(L; N)) 
of PEs working in parallel, depend on the number N of 
lattice sites per PE, as well as on the number N}, of border 
lattice sites per PE, and on the communication topol ogy 
among the PEs. Our earlier large-scale simulations [2(j 
show that the worst-case scenario of N = 1, studied in 
this work, can be greatly improved when N is increased 
while retaining the ring communication topology with 
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the effective N b — 2 (Fig. 01. On the other hand, the 
case of Nb = 2 seems to have limited applications and 
should rather be considered as an intermediate model 
towards the study of more realistic conservative PDES 
where the effective Nb may be arbitrary. From this per- 
spective the case of Nb = 2 is really the best case sce- 
nario since the utilization declines with increased number 
of communications between PEs, i.e., when the effective 
Nb increases. In the ring communication topology, it can 
be expected that the actual speed-up in the ideal system 
of PEs should be larger than it is in the worst-case sce- 
nario with N = 1, but smaller than it is in the best-case 
scenario with large N and Nb = 2. The actual speed-up 
in the real system of PEs should not be larger than a 
theoretical upper bound for the ideal system. Deriving 
a theoretical estimate for this upper bound requires first 
finding a closed expression for (u(L; 2)), a problem that 
is still awaiting solution. 



V. DISCUSSION 

We showed in Sec. lHII that for steady-state simulations 
for a closed chain it is possible to explicitly correlate de- 
position statistics with the surface morphology on a mi- 
croscopic scale. In our approach we used simple laws 
of statistics to build distinct configuration classes of the 
virtual time horizon. For one particular rule of surface 
growth, we constructed binary trees from which we could 
read the surface equivalency classes, serving our purpose 
of counting a particular type of configurations, relevant 
in deriving an approximate analytical expression for the 
measured utilization, i.e., the measured frequency of the 
time deposition. In the case solved in this work, the sum- 
mation process was technically easy once the symmetries 
in the binary graphs became apparent (Fig. [HJ) - In princi- 
ple, our method may be applied to any surface that grows 
on a lattice by a known growth rule, and it can be gener- 
alized to any measurable quantity Q. When the growth 
recipe is available it should be possible to construct dia- 
grams of elementary site configurations and to translate 
the growth rule to dependences among the graphs. An 
observable Q can then be computed as the average over 
all available groups of classes of the surface configura- 
tions: Q = f(p)q(p), where f(p) has the meaning 
of the probability distribution and q{p) is the value of 
Q characteristic for each group. In the example given 
in this work, the statistical weights assigned to each leg 
of an elementary transition diagram (Fig. 0) were taken 
to be equal. However, a different deposition rule may 
require a different choice of elementary diagrams and a 
different choice of statistical weights. The present appli- 
cation of the method to the Id deposition problem on 
the STH surface is a promising example that could be 
generalized to a variety of other growth processes. 

In a common approach one finds the universal proper- 
ties of growing interfaces from a stochastic growth equa- 
tion that is solved in a coarse-grained approximation at 



large scales. One powerful technique is the renormaliza- 
tion group approach [2l|, y(J, LiH |32| • The coarse-grained 
solution of the stochastic dynamics provides asymptotic 
scaling properties in the limit of large system sizes. Be- 
cause of its continuum nature this method is not capable 
of giving a detailed microscopic description of interfaces 
such as, e.g., the probability distribution of events on the 
growing surface. When applied to the model of conserva- 
tive PDES studied in this paper, the continuum method 
does not give any estimate of the utilization of the par- 
allel environment and the speed-up. Neither does it give 
the scaling behavior for the utilization in the limit of in- 
finite system size. In earlier work, Korniss et al. [T?) 
used a coarse-grained method to determine that, in the 
steady state for N = 1 , the conservative STH is governed 
by the Edwards- Wilkinson Hamiltonian 33], which im- 
plies a non-zero utilization in the infinite L limit, i.e., the 
asymptotic scalability of a generic conservative PDES. 
This finding explained the observed tendencies in the 
time evolution of the large-scale simulation data for the 
utilization, which clearly showed that the steady-state 
mean utilization settles down at a non-zero value, slightly 
lower than 1/4. The simplified analysis of the micro- 
scopic structure of the conservative STH at saturation, 
presented in this work, enabled us to derive the analytical 
formula for the utilization, Eq. jljl. in the approximation 
where correlations among various surface configuration 
classes are absent during the steady state. Equation Q 
provides the explicit scaling relation for the utilization, 
which shows directly the asymptotic scalability of conser- 
vative PDES. The limiting value lim i _ roo (u(L; 1)) = 1/4 
coincides with the estimate in Ref. [ljj • The actual sim- 
ulated values for the mean utilization {u(L; 1)) fall below 
the analytical curve (Fig. El). This small, but statistically 
significant, difference is evidence for small spatial and 
temporal correlations inherently present in the simula- 
tion data during the steady state. Theoretical treatment 
of these correlations, which would provide a correction to 
the theoretical distribution derived in this work, remains 
to be explored. 

The closed form of the event distribution during 
steady-state simulations (Eq. 10) enables one to com- 
pute analytically the mean of any observable that de- 
pends on the number of local minima in the STH surface 
(Eq. In this way we derived the explicit expressions 

for (p n ) (Eqs. (JDJ-JTSJ) that are valid for all values of L 
within the adopted model. When N = 1, the measured 
mean number of local minima (p) directly translates to 
the utilization and to the speed-up of conservative PDES. 
This approach has an advantage over any of the common 
continuum methods because it gives not only the exact 
scaling relations in the limit of large system sizes, but 
also enables one to compute analytic quantities valid for 
any system size. While the asymptotic properties of a 
PDES (such as, e.g., the scalability as L increases) are 
of theoretical interest, the explicit formulas for the per- 
formance evaluation of a PDES for finite L and N are of 
practical value in algorithm design. 
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When the number of lattice sites per PE is N > 3, 
each PE carries two kinds of sites, interior sites and bor- 
der sites. This generates two groups of mutually exclusive 
update events: updates when an interior site is randomly 
chosen and updates when a border site is randomly cho- 
sen. These groups of events are mutually exclusive be- 
cause during the tth update attempt the drawing of a 
lattice site is performed only once per PE. Finding the 
mean fraction of PEs that made an update while a bor- 
der site was selected requires solving the case of N = 2. 
This case is different from the case of N = 1 because 
the update rule changes. Now, at each t we first ran- 
domly select a border site on the fcth PE, and when the 
neighboring PE of the selected site has its local simulated 
time larger than t& the fcth PE makes the update. This 
"one-sided" rule changes the deposition pattern (the STH 
growth rule) because now the fcth PE site does not need 
to be in elementary configuration A to make an update. 
A new rule implies a new definition of p-group config- 
uration classes that now is characterized by p updates 
(for N = 1 it is defined by p local minima). Also, the 
way in which new frequencies f(p; L) are computed must 
incorporate a random choice of border sites. We leave 
the questions of deriving update distributions for N > 2 
open for future investigations. 



VI. SUMMARY 

We simulated the performance of an ideal closed chain 
of PEs that work in parallel in an asynchronous manner, 
with updates following a generic conservative algorithm. 
The conservative update rule can be seen as the mecha- 
nism that determines the growth of the virtual time sur- 
face of a conservative process. The physics of this growth 
is reflected in the utilization and in the interface width. 

We showed that it is possible to make an explicit con- 
nection between the steady-state utilization and the mi- 
croscopic structure of the virtual time interface at satu- 
ration. We exploited this connection to derive an ana- 
lytical formula for the probability distribution of the up- 
date events in the system within an approximate model. 
Then, having the model probability distribution, we com- 
puted explicit expressions for the mean utilization and 
the computational speed-up as functions of the system 
size. Our result states that the speed-up for the ideal 
closed chain of PEs, each carrying one lattice site, grows 
linearly with the system size as s = (L + l)/4 for L > 4, 
and s — 1 for L — 2, 3. 

Finally, we observe that our approach could be applied 
to a variety of other growth processes. In this sense, the 
present Id application to the update problem in conser- 
vative PDES is a promising example. The main advan- 
tage of the approach is that it enables one to compute 
analytically quantities that otherwise can be only esti- 
mated qualitatively in an asymptotic fashion by contin- 
uum methods. 
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APPENDIX A: DERIVATION OF f(p; L) 

To find the theoretical frequency f(p;L) — M(p)/M, 
one should compute the number M{p) of surface config- 
uration classes that contain the elementary configuration 
A (a local minimum of the STH) at exactly p sites. Both 
M(p) and M (the total number of configuration classes) 
can be easily found by simply counting the branches of 
the binary trees presented in Fig. |S| 

For a given L, we start with the highest level tree 
1A(L - 1) (the left tree in Fig. [SJ|. This tree has 
{L — 1) branching levels. The 1A(L — 1) tree branches 
to 2C(L - 2) and 2D(L - 2) (the right tree in Fig. |SJ>, 
which have (L — 2) branching levels, etc. The branch- 
ing ends up at (L - l)A(l), (L - l)B(l), (L - l)C(l) 
and (L— l)D(l) that do not branch. Therefore the total 
number of branches is M = (1/2) W^Zl 2 = 2 L ~ 2 . 

The factor 1/M is the probability that any class of 
the entire surface configuration appears in an individual 
simulation at t. Assuming that each leg of the binary 
branch in Fig. [S] carries the statistical weight of 1/2, this 
probability is 1/2™, where n = L— 2 is the highest binary 
branching level. 

To find M(p), notice (Fig. Eland Fig. |BJ that in each 
A-tree there is at least one sub-branch that contains no A 
(along the C-branch) and in each D-tree there is exactly 
one sub- branch that contains no A (along the B- branch) . 
Let n(p; fcX(r)) denote the number of branches that con- 
tain exactly p number of As in the sub-branch fcX(r). 
Here, "X" stands either for "A" or "D" . In this notation 
the "exactly one branch with no A in the D tree" means: 

n(0;kD(r)) = 1. (Al) 

The fcth level A-tree has n(l; fcA(r)) branches that con- 
tain exactly one A: 

r-i 

n (1; kA(r)) = 1 + ^ n (0; (fc + r - s) D{s)) . (A2) 

s=l 

The meaning of Eq. I|A2(1 can be clarified by the exam- 
ple of the tree in Fig. [fJl The first branch-cut (the left 
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dashed line) separates the C sub-branch from the remain- 
der. The C sub-branch (to the left of the first cut) has 
only one (leading) A. Therefore, there is 1 on the rhs of 
Eq. HA2jl . The remainder (to the right of the first cut) 
is the sum of the three D-trees: 4D(1), 3D(2),and 2D(3). 
Therefore, the sum in Eq. (|A2|I has three terms. Since the 
remainder has already one A in the leading position, the 
sum must contain only the sub-trees that contain p = 
number of As. In a similar fashion we obtain for the D 
tree: 



Eq. 1A4(> . after substituting Eq. ijAlfl . gives: 
M(2) = n(2;lA{L- 1)) 



L-2 s-1 



l-l 



EE h + E 1 



s=l 1 = 1 

L-3 



k(k + 1) 



m=X / 

L-4 



E 



E 

k=l k=0 
L-l 

3 



k + 2 



(A6) 



(1; kD{r)) = n (1; (k + r - s) A(s j) . 



(A3) 



For p = 3 the iteration leads to: 

M(3) = n(3;lA(L- 1)) 



L-2 



where the summation extends over all A-trees because 
the left sub-branch of the D tree contains no As (Fig. |SJ 
Fig.©. 

To find n(2; /cA(r)), one must first sum over all D-trees 
(to the right of the first cut in Fig. [B] or along the left 
graph in Fig. |HJ and then sum over all A-trees (to the 
right of the second cut in Fig. or along the right graph 
in Fig. HI: 



'£n(2;(L-s)D(s)) 

s=l 

L-2 s-1 

£5>(2;(L-0A(0) 

s=l 1=1 

L-2 s-1 l-l r-1 / k-1 

EEEE^ + E 1 

s=l 

L-5 

E 



s=l 1=1 r=l k=l 
L 



m=l / 

k(k + l)(fc + 2)(fc + 3) 



n(2;kA(r)) = ^ n (1; (k + r - s) D{s)) (A4) 

s=l 

J — 1 s — 1 

= EE n d;(fe + r-0^(0) 

s=l 1=1 

r-1 s-1 / l-l 

= 1+ E n(Q;(k + r-m)D(m)) 



k=l 

L-6 

E 

fc=0 



2-3-4 



k + A 



L - 1 
5 



(A7) 



where we used Eq. (IA4|I and Eq. l|Aip . For general p = 
1,2,..., \U] , we obtain: 



s=l 1=1 



m— 1 



In the last step of Eq. l|A4jl we used Eq. (|X2|) . 

In summary, in counting the branches with exactly p 
number of As we utilize the recurrent structure of the 
binary trees presented in Fig. [S] We perform the branch 
cuts that mark the transition from a higher level tree to 
a lower level tree and iterate along the branch-cut lines. 
The iteration is completed when the summation term has 
a form given by Eq. IjAlfl . In this way, continuing from 
Eq. HA2|). we obtain: 



M(l) = n (1; 1A(L - 1)) 

L-2 



M(p) = n(p;lA(L- 1)) 

L-2 fc2 P -3-l 

- E- E 

ki — 1 fc 2 p-2— 1 

L - 2p 'k + 2p-2 



1 



E 



k=0 



k2p-2 — 1 

E i 

k=l 

L - 1 
2p- 1 



(A8) 



2p- 2 

APPENDIX B: MOMENTS OF f{p; L) 

For the purpose of the computation of moments we 
introduce the following notation: 



(p")L = ^m"/(m;i), 



(Bl) 



rn — 1 



= l + ^n(0;(L- s)D(s)) 

s=l 
L-2 

= 1 + ^1 = L-l, 



(A5) 



where n = 0,1,2,.... By manipulating the factorials in 
/(m; L) and shifting the summation index on the rhs of 
Eq. (EH; it is straightforward to show that: 



s=l 



where we used Eq. (|A1I) . Similarly, continuing from 



{pn)L= L(LTT) E (fc"l)" +1 (2fc-l)/(fc;L + 2). 

(B2) 
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With the substitution L — > L — 2, Eq. (|B2|I becomes a 
recursion relation for (p™)^: 

< P "U- 2 (L " 2) 8 (L " 1) =(( P -1)" +1 (2P-D) L . 

(B3) 

Using the binomial formula in Eq. (|B3|I . the recursion 
relation can be explicitly written out as: 



(P n+2 )L = (p n )L-2 



(L-2)(L-1) , (p 



n+l\ 



4 2 



(B4) 

(pV 



i=0 ^ 



To obtain (p n )z, for arbitrary n, we iterate Eq. (|B4|I . 
starting with the initial n = and using the identities: 



(p°)l = 1, 



L + l 



(B5) 
(B6) 



Equation l|B5|) expresses the normalization condition. 
Equation l)B6|) follows from Eq. (|B1|I after simple al- 
gebra and from Eq. i|B5l) . 
For n = 0, Eq. (|B4l) gives: 



For n = 1, Eq. (|B4|> gives: 
<P 3 }l 



2 . _L(L + 3) 



42 



(L - 2)(L - 1 i( P V-2 



(B7) 
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5(p 2 ) 



(p° 



(B8) 



Substituting Eqs. l|B5|) - (|B7|) leads to: 



(P 1 



L 3 + 6L 2 + 3L-2 
L - 43 - 



(B9) 



In a similar fashion, for n = 2, Eqs. (|B4|I - (|B7J| and 
Eq. I|B9|I give: 



h L (L 3 + 10L 2 + 15L - 10) 



<P 4 )l = 



4 4 



(BIO) 



The variance a 2 , the skewness, and the kurtosis of f(k; L) 
can be computed in the standard way |34|. The vari- 
ance is given in Eq. I|1U|) . For the skewness we obtain 
skew(/) = 0. The kurtosis is a positive function of L. 
Explicitly, for L > 4: 



kurt(/) = 2 



3L 3 + 6L 2 - 10L + 1 



(Bll) 



Equation (|13fl is derived in a similar way as Eq. (|B3|I . 

by simple algebra and by shifting the summation index. 
An arbitrary power (p~ n ) can be obtained by deriving 
the corresponding recursion relation, following the lines 
outlined above for n > 0. 
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